17 research outputs found

    Statistical approaches to the study of protein folding and energetics

    Get PDF
    The determination of protein structure and the exploration of protein folding landscapes are two of the key problems in computational biology. In order to address these challenges, both a protein model that accurately captures the physics of interest and an efficient sampling algorithm are required. The first part of this thesis documents the continued development of CRANKITE, a coarse-grained protein model, and its energy landscape exploration using nested sampling, a Bayesian sampling algorithm. We extend CRANKITE and optimize its parameters using a maximum likelihood approach. The efficiency of our procedure, using the contrastive divergence approximation, allows a large training set to be used, producing a model which is transferable to proteins not included in the training set. We develop an empirical Bayes model for the prediction of protein β-contacts, which are required inputs for CRANKITE. Our approach couples the constraints and prior knowledge associated with β-contacts to a maximum entropy-based statistic which predicts evolutionarily-related contacts. Nested sampling (NS) is a Bayesian algorithm shown to be efficient at sampling systems which exhibit a first-order phase transition. In this work we parallelize the algorithm and, for the first time, apply it to a biophysical system: small globular proteins modelled using CRANKITE. We generate energy landscape charts, which give a large-scale visualization of the protein folding landscape, and we compare the efficiency of NS to an alternative sampling technique, parallel tempering, when calculating the heat capacity of a short peptide. In the final part of the thesis we adapt the NS algorithm for use within a molecular dynamics framework and demonstrate the application of the algorithm by calculating the thermodynamics of allatom models of a small peptide, comparing results to the standard replica exchange approach. This adaptation will allow NS to be used with more realistic force fields in the future

    Improving protein-protein interaction prediction using evolutionary information from low-quality MSAs.

    Get PDF
    Evolutionary information stored in multiple sequence alignments (MSAs) has been used to identify the interaction interface of protein complexes, by measuring either co-conservation or co-mutation of amino acid residues across the interface. Recently, maximum entropy related correlated mutation measures (CMMs) such as direct information, decoupling direct from indirect interactions, have been developed to identify residue pairs interacting across the protein complex interface. These studies have focussed on carefully selected protein complexes with large, good-quality MSAs. In this work, we study protein complexes with a more typical MSA consisting of fewer than 400 sequences, using a set of 79 intramolecular protein complexes. Using a maximum entropy based CMM at the residue level, we develop an interface level CMM score to be used in re-ranking docking decoys. We demonstrate that our interface level CMM score compares favourably to the complementarity trace score, an evolutionary information-based score measuring co-conservation, when combined with the number of interface residues, a knowledge-based potential and the variability score of individual amino acid sites. We also demonstrate, that, since co-mutation and co-complementarity in the MSA contain orthogonal information, the best prediction performance using evolutionary information can be achieved by combining the co-mutation information of the CMM with co-conservation information of a complementarity trace score, predicting a near-native structure as the top prediction for 41% of the dataset. The method presented is not restricted to small MSAs, and will likely improve interface prediction also for complexes with large and good-quality MSAs

    Predicting protein : sheet contacts using a maximum entropy-based correlated mutation measure

    No full text
    Motivation: The problem of ab initio protein folding is one of the most difficult in modern computational biology. The prediction of residue contacts within a protein provides a more tractable immediate step. Recently introduced maximum entropy-based correlated mutation measures (CMMs), such as direct information, have been successful in predicting residue contacts. However, most correlated mutation studies focus on proteins that have large good-quality multiple sequence alignments (MSA) because the power of correlated mutation analysis falls as the size of the MSA decreases. However, even with small autogenerated MSAs, maximum entropy-based CMMs contain information. To make use of this information, in this article, we focus not on general residue contacts but contacts between residues in β-sheets. The strong constraints and prior knowledge associated with β-contacts are ideally suited for prediction using a method that incorporates an often noisy CMM. Results: Using contrastive divergence, a statistical machine learning technique, we have calculated a maximum entropy-based CMM. We have integrated this measure with a new probabilistic model for β-contact prediction, which is used to predict both residue- and strand-level contacts. Using our model on a standard non-redundant dataset, we significantly outperform a 2D recurrent neural network architecture, achieving a 5% improvement in true positives at the 5% false-positive rate at the residue level. At the strand level, our approach is competitive with the state-of-the-art single methods achieving precision of 61.0% and recall of 55.4%, while not requiring residue solvent accessibility as an input

    The effect of co-conservation and co-evolution on the interface prediction.

    No full text
    <p>Left: The fraction of proteins for which a near-native decoy is in the top scored predictions, as a function of the number of decoys considered, for the <i>S</i>(<i>S</i><sup>RP</sup>, <i>S</i><sup>N</sup>, <i>S</i><sup>ent</sup>) (grey solid line), <i>S</i>(<i>S</i><sup>RP</sup>, <i>S</i><sup>N</sup>, <i>S</i><sup>ent</sup>, <i>S</i><sup>CMM</sup> (black solid line), <i>S</i>(<i>S</i><sup>RP</sup>, <i>S</i><sup>N</sup>, <i>S</i><sup>ent</sup>, <i>S</i><sup>CT</sup> (grey dashed line), <i>S</i>(<i>S</i><sup>RP</sup>, <i>S</i><sup>N</sup>, <i>S</i><sup>ent</sup>, <i>S</i><sup>CT</sup>, <i>S</i><sup>CMM</sup>) (black dashed line) and (light grey dash-dotted line) scoring functions. Right: The number of proteins for which the rank of the top near-native prediction is within the top 1, 5 or 10 predictions, for the <i>S</i>(<i>S</i><sup>RP</sup>, <i>S</i><sup>N</sup>, <i>S</i><sup>ent</sup>) (solid black bars), <i>S</i>(<i>S</i><sup>RP</sup>, <i>S</i><sup>N</sup>, <i>S</i><sup>ent</sup>, <i>S</i><sup>CMM</sup> (solid grey bars), <i>S</i>(<i>S</i><sup>RP</sup>, <i>S</i><sup>N</sup>, <i>S</i><sup>ent</sup>, <i>S</i><sup>CT</sup> (dark checked bars) and <i>S</i>(<i>S</i><sup>RP</sup>, <i>S</i><sup>N</sup>, <i>S</i><sup>ent</sup>, <i>S</i><sup>CT</sup>, <i>S</i><sup>CMM</sup>) (light checked bars) scoring functions.</p

    MSAs in the dataset.

    No full text
    <p>The cumulative distribution function of protein complexes in the dataset as a function of the number of sequences in their MSA. 95% of protein complexes have fewer than 400 sequences. Right: The effective number of sequences as a function of the number of amino acids in the protein complexes studied.</p

    Comparison of the interface-level scoring functions using CMM.

    No full text
    <p>The fraction of proteins for which there is at least one near-native complex in the top predictions, for the scoring functions <i>S</i><sup>CMM</sup> (black dash-dotted line), (light grey dash-dotted line), <i>S</i>(<i>S</i><sup>RP</sup>, <i>S</i><sup>N</sup>, <i>S</i><sup>ent</sup>) (grey solid line), <i>S</i>(<i>S</i><sup>RP</sup>, <i>S</i><sup>N</sup>, <i>S</i><sup>ent</sup>, <i>S</i><sup>CMM</sup>) (black solid line) and (light grey solid line).</p

    Probability distribution of the residue-level CMM scores.

    No full text
    <p>The distribution of the standardised <i>Z</i>(<i>i</i>, <i>j</i>) scores for all residues (solid line) and for the interface residues of the native structure (dashed line). Left: Probability distribution function, Right: cumulative distribution function. Dash-dotted line shows 0, the mean of the standardised scores.</p

    Efficient Parameter Estimation of Generalizable Coarse-Grained Protein Force Fields Using Contrastive Divergence: A Maximum Likelihood Approach

    No full text
    Maximum Likelihood (ML) optimization schemes are widely used for parameter inference. They maximize the likelihood of some experimentally observed data, with respect to the model parameters iteratively, following the gradient of the logarithm of the likelihood. Here, we employ a ML inference scheme to infer a generalizable, physics-based coarse-grained protein model (which includes Go̅-like biasing terms to stabilize secondary structure elements in room-temperature simulations), using native conformations of a training set of proteins as the observed data. Contrastive divergence, a novel statistical machine learning technique, is used to efficiently approximate the direction of the gradient ascent, which enables the use of a large training set of proteins. Unlike previous work, the generalizability of the protein model allows the folding of peptides and a protein (protein G) which are not part of the training set. We compare the same force field with different van der Waals (vdW) potential forms: a hard cutoff model, and a Lennard-Jones (LJ) potential with vdW parameters inferred or adopted from the CHARMM or AMBER force fields. Simulations of peptides and protein G show that the LJ model with inferred parameters outperforms the hard cutoff potential, which is consistent with previous observations. Simulations using the LJ potential with inferred vdW parameters also outperforms the protein models with adopted vdW parameter values, demonstrating that model parameters generally cannot be used with force fields with different energy functions. The software is available at https://sites.google.com/site/crankite/
    corecore